Beyond keyword search for data sources on the World Wide WebBeyond keyword search for data sources on the World Wide Web
Matthew Michelson, C. Knoblock
One of the most important features of the World
Wide Web is its ability to empower users with lots of information.
However, much of this information is still unorganized and inaccessible
beyond a simple keyword search. For example, consider the auction site
EBay (http://www.ebay.com).
If a user wants to determine the average asking
price for an item, or count the occurrences, he or she would have to
submit a keyword search, retrieve all of the listings, filter out those
that do not apply, and determine item by item the answer to the
questions. Also, note that a keyword search would miss all of the items
that had misspelled the name. Clearly, keyword search is not powerful
enough to yield interesting answers about the data, especially if we
prefer to have programs determine these answers, rather than users.
It is preferable to embed within the EBay listings
a mechanism that allows them to be queried in a structured manner. This
way, determining the average price or count of an item would be a
simple, one line query that even a program could perform. This article
presents one such method to allow data sources, such as EBay, to
support structured queries. We call each of the listings a post and the
goal is to extract from each post the attributes embedded within that
post that describe the entity. This extraction is more formally known
as Information Extraction (IE).
Using our example from EBay, perhaps the posts are
about cars. In this case, the attributes would be descriptive pieces
about the car, such as the make, model or year, and performing IE on a
post would allow us to pick out all of the relevant attributes, even if
they are misspelled and regardless of their placing within the text of
the post. Once we extract the attributes, we can then add tags around
them, and query the data source using these tags as the query schema.
The addition of these tags is known as annotation. The overall approach
to annotation is shown in Figure 1, using an example post from EBay.
The different parts of this figure are described later in the article,
but it is useful to point out the examples of annotation at the
bottommost black box of the figure.
Figure 1. The points of level of similitude registration
Unstructured and ungrammatical data sources
In this article we focus on annotating data sources
that are unstructured and ungrammatical. By unstructured data sources
we mean data sources that vary from post to post in the order and
inclusion of attributes. One EBay post about a car might include a
model and year, in that order, while another would have year and make.
Since the order and inclusion of the attributes varies randomly within
a post, we can not exploit this structure, which would make the problem
easier.
Ungrammatical refers to the fact that most of the
time these posts do not conform to the rules of language. If they did,
we could exploit Natural Language Processing techniques to more easily
discover attributes. For instance, it might help to identify the nouns
in the post. However, since the post is not grammatical, it can not be
analyzed as a sentence.
Specifically, we focus on ungrammatical and
unstructured data sources because there has already been much research
on extracting attributes from semi-structured and structured data, such
as web pages, as well as data that conforms to the rules of a language,
such as extracting entities from news articles. IE on ungrammatical and
unstructured data sources poses a difficulty because without structure
or grammar there are few clues to identify which items in a post are
attributes, versus which items in the post are tokens that can be
ignored, which we call junk.
Reference Sets
Since we can not rely on the structure or grammar of the
data sources, we must infuse IE with outside knowledge to give it clues
as to which pieces of a post are attributes and which are junk.
This outside knowledge comes in the form of
reference sets. A reference set is a set of entities with the
associated attributes. It can be an online or offline database or an
online or offline set of documents. Using our cars example from EBay,
one reference set could be the Edmunds car website (http://www.edmunds.com)
which provides the attributes for car entities. The attributes in this
case are data such as make, model, year and trim. The information on
this website can be organized to look much like a database, where each
car entity has its associated attributes, and this database of cars
constitutes a reference set. Figure 1 includes part of the example
reference set from the Edmunds website.
A reference set is used by aligning each post to
the best matching member of the reference set. This alignment procedure
is known as Record Linkage, and is represented by the blue box of
Figure 1 with the label Record Linkage Step. The record linkage step
takes as input the reference set and the post, and outputs the best
possible match from the reference set, if one exists. This best
matching post then provides the clues necessary for doing IE because we
can search within the post for the attribute values from the reference
set member. How exactly the record linkage is done is the focus of the
next section.
Aligning Internet posts to reference sets
The record linkage step
Up to this point we have described what posts are
and what makes them unique, namely their unstructured and ungrammatical
nature. However, because of this lack of grammar and structure, we
realize we need some sort of outside knowledge in order to perform the
IE. This outside knowledge comes in the form of reference sets, and we
mentioned that we use the reference sets by discovering the record from
the reference set that best matches the post we are annotating.
Discovering this best match is known as record
linkage. Record linkage has been studied in the artificial intelligence
and database communities for a long time. Traditional record linkage
takes a record from one data source and finds its match in another.
This is done by examining the attributes from each data source and
deciding that the records that contain the most similar attributes
probably match. Figure 2 shows a good example of traditional record
linkage on two data sources of hotels.
>
Figure 2. An example of traditional registration.
Our record linkage difference
However, our record linkage is slightly different
and requires a new approach. A post is not yet decomposed into
attributes, so the traditional approaches to record linkage do not
apply. That is, the attributes to examine for similarity are embedded
within the post, so an attribute by attribute comparison is not
possible.
Instead, our method involves creating a vector of
similarity scores between the post and all of the attributes from the
reference set concatenated together. This way, we can approximate the
attribute by attribute similarity for all of the attributes at once,
which gives a similarity between the whole record and the post. This is
called record level similarity. So, for example, using the first
reference set member from Figure 1, the record level similarity scores
would be similarity scores between the post, 2001 VOLKSWAGEN JETTA GLS
VR6 EXTRA CLEAN FREE SHIPPING and the concatenated attributes Volkswagen Jetta 2001 4 Dr GLS VR6 Sedan.
However, there is not enough information in the
record level similarity to discern a match. Specifically, it is
possible that two records will share the same record level similarity
with a post, while differing on which attributes in the records caused
this similarity.
For example, consider Figure 3, which shows the
record linkage between a post about hotels and a reference set of
hotels. In the figure, each member of the reference set matches the
post on 2 attributes, with the hotel name in common. However, the first
matches on the hotel area while the second matches on the star. A hotel
area is more discriminative than a star rating, so we need someway to
reflect the similarities between the post and each attribute
individually. This is done by appending similarity scores between the
post and each attribute from the reference set so the vector of
similarity scores. These individual attribute comparisons give an
approximation of an attribute by attribute comparison, and are called
field level similarity. With our running example, we would generate
similarity scores between the post and Volkswagen, the post and Jetta,
and so on.
So, our total vector of similarities includes both
the notions of record level similarity, with the attribute
concatenation, and field level similarity with each attribute
individually. It should be noted that this vector of similarities is
not created for every record of the reference set. In fact, in record
linkage in general, the members of one data set are never compared to
all of the members of the other data set. Instead, just a subset of the
records, called candidates, is chosen first, and then these candidates
are used for the record linkage process.
The choosing of candidates for record linkage is
called blocking. The goal of blocking is to limit the size of the
candidate sets, without removing any true matches, as quickly as
possible. Our annotation technique is independent of the blocking
algorithm chosen. For example, we may simply say any reference set
member that shares a common token with the post is a candidate.
Once all of the candidate records from the
reference set have a vector of similarity scores, they are all rescored
in a binary fashion. Specifically, at each index of the similarity
vector, the candidates with the maximal value at that index change
their value to 1, and the rest of the candidates change their value to
0. For instance, given two candidate match vectors:
V1 = (0.1, 0.2, 0.5, …, 2.1)
V2 = (0.2, 0.1, 0.5, …, 0.8)
After binary rescoring, they would become:
V1 = (0, 1, 1, …, 1)
V2 = (1, 0, 1, …, 0)
This binary rescoring is done to emphasize the
difference in scores between vectors. Certain vectors will have indexes
with similarly close scores, but these will be exaggerated by the
binary rescoring, so the best match should be more easily detected.
The building and scoring of the similarity score
vectors is all a preprocessing step for the machine learning portion of
the record linkage step that will tell us which of the candidates is,
in fact, the best match to the post.
The machine learning portion of the record linkage
takes the candidate set (now a set of binary vectors) and labels them
as matches or non-matches. It can also return a confidence score
associated with the matches. Then, we can simply take the candidate
labeled as a match, with the highest confidence score, as the best
match. If we want to impose the idea that there is only one reference
set member for each post, then we throw out any candidates that have
the same maximal confidence score. This way, we ensure that only 1
candidate can ever match a post. If we want to relax this restriction,
we can pick any of the candidates with the maximal score as the best
match.
The actual machine learning technique used is known
as a Support Vector Machine (SVM). SVMs are used widely in artificial
intelligence research as an effective machine learning technique. The
key idea is to consider the vectors in n-dimensional space. If the
learning problem were easy, we could fit an n-dimensional plane, called
a hyperplane, between the vectors that are matches and those that are
non-matches. Then, depending on where a vector lies, we can determine
its class. Since we are unable to do this, the SVM maps each of the
vectors into a new n-dimensional space, where it is able to fit a
hyperplane that distinguishes the sets. It then determines whether a
vector is a match or non-match, depending on where it lies in relation
to this new hyperplane in the new feature space.
SVMs are a supervised machine learning technique,
which means they require labeled training data. Labeled training data
consists of pairs of posts and reference set members that have been
marked already as matches and non-matches. These allow the SVM to learn
how to fit the hyperplane to the data so that it can classify
previously unseen vectors.
Figure 3. To show the record linkage between a post about hotels and a reference set of hotels
Once the SVM has determined the best match to the
post, there is one final procedure in the record linage step. We append
annotation to the post that includes the attributes from the best
matching reference set member. In Figure 1, one of these attributes is
shown with the tags .
This is done for quite a few reasons. First, it gives standard values
upon which to query the data. If we use the actual extracted values to
query the data, we run into the same problem as keyword search, namely
that different misspellings would leave certain posts out of the query.
Second, this includes attributes that may not have been included by the
user. In our ongoing EBay car example, a post might include a model,
trim and year. If we query the posts on make, however, this post would
be left behind since it has no make in it explicitly. However, with the
reference set attributes annotated to it, it now has a make and would
be returned in the query. Last, there are certain attributes that are
extremely hard for IE. The extraction might perform poorly enough on
these attributes to not be useful for queries because it would either
return many incorrect results or not enough of the posts. However, if
the record linkage step did well, then we would have these attributes
from the reference set to use instead, so that attribute is still
useful to query the data.
Using the best match from the reference set for extraction
Once we have found the best matching member from
the reference set, we can exploit this member for IE. The intuition is
to take each token of the post and see if it matches any of the
attributes from the reference set. This is called the Information
Extraction step and is shown as the green box of figure 1.
The extraction method
Specifically, we take each token of the post and
create a vector of similarity scores between this token and each
attribute from the matching member of the reference set. This is
similar to creating the vector of similarity scores in the record
linkage step, except that we do not perform a binary rescoring on this
vector, since we do not need to pick a winner from a candidate set, as
before. Also, this vector contains a unique set of scores for comparing
the similarity of this token to special attributes called common
attributes.
In general, these common attributes are data that
are not easily represented by reference sets, yet they exhibit enough
identifying characteristics to exploit. Examples of common attributes
would be prices and dates. These more reliable characteristics allow
for the extraction of these data types using more traditional
techniques, such as regular expressions. So, we include scores for
matching a price regular expression, for instance. This score would
give a positive score for a match and 0 otherwise. This allows for the
extraction of useful attributes in the post that are easily extracted
with some expert rules.
Once we have created the vector of similarity
scores, we attempt to identify the attribute type of the token�s
vector by passing the vector to a multi-class SVM. Similar to the SVM
trained to label matches or non-matches, a multi-class SVM is able to
identify a token as a member of n classes. In our case, n-1 of the
classes are the attribute types, such as car make or car model, and the
nth class is junk, which means the token can be ignored.
Once all of the tokens in a post have been
identified as either belonging to an attribute or junk, we clean each
of the whole, extracted attributes from the post. The whole, extracted
attribute is just the concatenation of each of the tokens within the
post that have the same label. In the ongoing example, the tokens GLS
and VR6 would each be identified as a car trim, so the whole extracted
car trim would be GLS VR6. Intuitively, we need to label tokens in
isolation, because the data can be totally unstructured, so we cannot
exploit the structure of the post. However, this introduces noisy
tokens, that is, tokens that should have been labeled junk, but were
not.
To rectify this problem, we take each whole,
extracted attribute and compare it to its matching attribute from the
reference set member. We remove one token at a time from the extracted
attribute, and if this new attribute is a better match to the reference
set attribute than the old one, this token becomes a candidate for
removal. After processing each token in this manner we either remove
the candidate for removal that yielded the best match to the reference
set attribute, or we terminate the process because no tokens yielded a
better match. If we do not terminate, then we start the cycle again.
This process not only removes the noisy tokens from
the extracted attribute, but it has an added benefit of disambiguation.
If a token could easily belong to more than one attribute type, we
could label it as both of them and then let the cleaning procedure
remove the token from the attribute that it should not belong to.
This is the entire technique for extracting the
attributes from a post to build annotation that allows for structured
queries. Figure 4 depicts the entire extraction process.
A few observations should be noted about this
extraction procedure. First, since there are usually overlapping
attributes in a reference set, this approach to extraction is robust to
errors made in the record linkage step. If the record linkage step
fails to correctly identify a matching member of the reference set, the
match it does identify usually has enough similar information to make
it useful. For example, with the cars on EBay, if the record linkage
step returned a 2001 Volkswagen Jetta, but with the incorrect trim,
there would still be enough useful information to extract the make,
model and year of the car in the post. This is one reason that the
algorithm should not stop after the record linkage step, although that
might seem tempting, since most of the data is annotated with standard
values at that point.
Figure 4. The entire extraction process
Another reason to perform the extraction, beyond
the obvious desire to see the actual values users enter for attributes,
is that training the system to extract all of the attributes will help
it to extract common attributes in two ways. First, it makes junk a
rarer classification, which improves the accuracy of classifying junk.
Second, it is useful to be able to classify what something is not.
Consider the following example. Perhaps a car model is called the $35.
This is a strange case indeed, but if the system were not trained to
extract car models, this token would surely become classified as a
price. However, since the system can learn that it is a car model, it
is not a price.
Algorithm Observations; Performance
This algorithm has been empirically shown to
outperform other methods of record linkage and extraction for the task
of semantic annotation. More specifically, the running example
throughout this article is rather easy one, and the algorithm has been
shown to perform well on much more difficult data, for example, posts
with lots of missing attributes and varied attribute value
misspellings. Interested readers can read the scientific publications
presented at the end of this article, in the Further Reading section,
to see the actual performance numbers. Furthermore, the research
publications give a more advanced and detailed treatment of the
algorithms.
Another aspect of performance to consider with any
supervised learning method is the amount of training data that the
system requires to perform well. Again, the research publications show
that training on just 10% of the data (less than 80 examples in some
cases) does not cause the algorithm to degrade in performance. This is
beneficial, since labeling training data is tedious and costly.
One last observation regarding performance relates
to the reference sets. It would seem that if we were to use two
separate reference sets (since the algorithm is not tied to just one),
then we would have to use the cross product of the records in these two
reference sets as a single, large reference set. This would make the
overall performance much worse because of the huge size of the new
reference set. This is not the case. The algorithm easily supports the
use of reference sets in an iterative fashion, so as many reference
sets as are needed can be used. This, in fact, is a useful optimization
where smaller reference sets of unique reference attributes can be
formed and used in place of large, single reference sets.
Conclusion
This article presented a method by which to make
the ungrammatical and unstructured data of the World Wide Web much more
useful by annotating the data to support structured queries. These
structured queries move beyond simple keyword searches to turn data
into information. In the future, when web agents will act upon our
behalf, booking travel arrangements and bartering for goods, these
types of queries will become crucial to the decision making process.
The research presented here is a first step, and we hope that future
improvements will help bridge the gap between the time when users have
to tediously perform tasks on the World Wide Web and the time when
their agents do it for them.